{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rL-LzAqpoGLC"
   },
   "source": [
    "# Ungraded Lab: Building a Vocabulary\n",
    "\n",
    "In most natural language processing (NLP) tasks, the initial step in preparing your data is to extract a vocabulary of words from your corpus (i.e. input texts). You will need to define how to represent the texts into numeric features which can be used to train a neural network. Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-nt3uR9TPrUt"
   },
   "source": [
    "The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [TextVectorization()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) preprocessing layer and its [adapt()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt) method.\n",
    "\n",
    "As mentioned in the docs above, this layer does several things including:\n",
    "\n",
    "1. Standardizing each example. The default behavior is to lowercase and strip punctuation. See its `standardize` argument for other options.\n",
    "2. Splitting each example into substrings. By default, it will split into words. See its `split` argument for other options.\n",
    "3. Recombining substrings into tokens. See its `ngrams` argument for reference.\n",
    "4. Indexing tokens.\n",
    "5. Transforming each example using this index, either into a vector of ints or a dense float vector.\n",
    "\n",
    "Run the cells below to see this in action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CLFb7wXVTQXc"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "# Sample inputs\n",
    "sentences = [\n",
    "    'i love my dog',\n",
    "    'I, love my cat'\n",
    "    ]\n",
    "\n",
    "# Initialize the layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization()\n",
    "\n",
    "# Build the vocabulary\n",
    "vectorize_layer.adapt(sentences)\n",
    "\n",
    "# Get the vocabulary list. Ignore special tokens for now.\n",
    "vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FNmxq2MMYbDc"
   },
   "source": [
    "The resulting `vocabulary` will be a list where more frequently used words will have a lower index. By default, it will also reserve indices for special tokens but , for clarity, let's reserve that for later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W-uJ4K_ts956"
   },
   "outputs": [],
   "source": [
    "# Print the token index\n",
    "for index, word in enumerate(vocabulary):\n",
    "  print(index, word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9vFOmuRbZUes"
   },
   "source": [
    "If you add another sentence, you'll notice new words in the vocabulary and new punctuation is still ignored as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VX1A1pDNoVKm"
   },
   "outputs": [],
   "source": [
    "# Add another input\n",
    "sentences = [\n",
    "    'i love my dog',\n",
    "    'I, love my cat',\n",
    "    'You love my dog!'\n",
    "]\n",
    "\n",
    "# Initialize the layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization()\n",
    "\n",
    "# Build the vocabulary\n",
    "vectorize_layer.adapt(sentences)\n",
    "\n",
    "# Get the vocabulary list. Ignore special tokens for now.\n",
    "vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NvM_N6J0tGAM"
   },
   "outputs": [],
   "source": [
    "# Print the token index\n",
    "for index, word in enumerate(vocabulary):\n",
    "  print(index, word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VoUPdaR6bIO-"
   },
   "source": [
    "Now that you see how it behaves, let's include the two special tokens. The first one at `0` is used for padding and `1` is used for out-of-vocabulary words. These are important when you use the layer to convert input texts to integer sequences. You'll see that in the next lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1NJNJZ8SQ3pM"
   },
   "outputs": [],
   "source": [
    "# Get the vocabulary list.\n",
    "vocabulary = vectorize_layer.get_vocabulary()\n",
    "\n",
    "# Print the token index\n",
    "for index, word in enumerate(vocabulary):\n",
    "  print(index, word)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GtY2P9HmkqWl"
   },
   "source": [
    "That concludes this short exercise on building a vocabulary!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
